SQL Data Cleaning

In this SQL project, I use sql to clean data that can later be used for further analysis. Here is a step to step walk through of the project.

  1. Creating a Database and Importing Data
  2. I started by creating a new database and importing the data.

    Fullscreen Image

  3. Data Overview
  4. I then had an overall overview of the data to understand it and to determine the what, where and how it would be cleaned.

    Fullscreen Image

  5. Date Standardization
  6. I then standardized the date column. This involved:

    • Converting the Sale Date column from Date-time format to date format. This however does not alter the column.
    • Adding a new column "sale_date_converted" to add the converted Sale Date Column.
    • Updating the new sale_date_converted column. We will later drop the SaleDate column.
    • Checking to see whether the new column has been added correctly.

    Fullscreen Image

  7. Populating null/missing values in the Property Address column
  8. I then went ahead to populate the missing values in the property address column. I observed that for the missing values, there was a similar row entry with the same parcel ID but different Unique IDs. So the next step was to populate the missing property address with the available property address of the other entry with the same parcel ID as they should be similar.

    Fullscreen Image

  9. Splitting the Property Address Column
  10. I then went ahead to split the property Address Column into the actual address and the associated city which are in one column.

    Fullscreen Image

  11. Splitting the Owner Address Column
  12. I then split the owner Address Column

    Fullscreen Image

  13. SoldAsVacant column standardization
  14. The next step was to standardize the SoldAsVacant column. The first step was to find out the values present and their count. This would give us a basis on how to standardize it based on the entries with the highest count. Since 'Yes' and 'No' have the highest count, we convert the Y and N into Yes and No respectively. This also makes it more understandable as Y and N can be confusing and open to misinterpretation on what they could mean.

    Fullscreen Image

  15. Removing Duplicates
  16. I then went ahead to remove duplicates in our value. This however is not standard practice and should be done with care. Alternatively, creating a working sheet from the original data for such manipulation is advisable while reserving the original data as it is.

    Fullscreen Image

  17. Dropping Unused Columns
  18. Finally, I finished off by deleting unused columns. This columns are Owner Address, TaxDistrict, PropertyAdress and SaleDate columns.

    Fullscreen Image

    To download and view the full project on GitHub, click here.